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Abstract 

Motivation: Illumina BeadArray technology includes non specific negative control features that 
allow a precise estimation of the background noise. As an alternative to the background subtraction 
proposed in BeadStudio which leads to an important loss of information by generating negative val- 
ues, a background correction method modeling the observed intensities as the sum of the exponentially 
distributed signal and normally distributed noise has been developed. Nevertheless, Wang and Ye [T9| 
display a kernel-based estimator of the signal distribution on Illumina BeadArrays and suggest that 
a gamma distribution would represent a better modeling of the signal density. Hence, the normal- 
exponential modeling may not be appropriate for Illumina data and background corrections derived 
from this model may lead to wrong estimation. 

Results: We propose a more flexible modeling based on a gamma distributed signal and a normal 
distributed background noise and develop the associated background correction. Our model proves to 
be markedly more accurate to model Illumina BeadArrays: on the one hand, it is shown on two types 
of Illumina BeadChips that this model offers a more correct fit of the observed intensities. On the other 
hand, the comparison of the operating characteristics of several background correction procedures on 
spike-in and on normal-gamma simulated data shows high similarities, reinforcing the validation of the 
normal-gamma modeling. The performance of the background corrections based on the normal-gamma 
and normal-exponential models are compared on two dilution data sets, through testing procedures which 
represent various experimental designs. Surprisingly, we observe that the implementation of a more 
accurate parametrisation in the model-based background correction does not increase the sensitivity. 
These results may be explained by the operating characteristics of the estimators: the normal-gamma 
background correction offers an improvement in terms of bias, but at the cost of a loss in precision. 

Availability: The R-code to perform the background correction in this new model are available in 
the R package NormalGaimna |http://cran.r-project.org71 l. 

1 Background 

Illumina BeadArray platform is a microarray technology offering highly replicable measurements of gene 
expression in a biological sample. Each probe is measured on average of thirty to sixty beads randomly 
distributed on the surface of the array, avoiding spatial artifacts and the reported probe intensity is the 
robust mean of the bead measurements. Fluorescence intensity measured on each bead is subject to several 
sources of noise (non-specific binding, optical noise, ...). Thus the intensities produced by the microarray 
require a background correction so that it takes into account measurement error. For that purpose, Illumina 
microarray design includes a set of non specific negative control probes which provides an estimate of the 
background noise distribution. 

*to whom correspondence should be addressed 
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In genome-wide microarrays, the observed intensity of a probe is usually modeled as the sum of a signal 
and a background noise. Namely, let X be the observed intensity of a given probe, we assume that 



X = S^B (1) 

where S is the true signal which counts for the abundance of the probe complementary sequence in the 
target sample and is independent of the background noise B. Only X is observed but the quantity of 
interest is the signal S. Therefore, a background correction adjusting the effect of noise on the true signal is 
necessary to enhance the biological validity of the results. In this context the knowledge of both signal and 
noise distributions provides a background correction procedure: the signal S is estimated by the conditional 
expectation of S given the observation X = x and given the distributions of B and S. Under parametric 
assumptions on B and S*, the problem is limited to the estimation of the parameters. Besides, in many 
experimental contexts involving measurement error the normal distribution of the noise is assumed. Specific 
arguments for microarray data find their origins in analytical chemistry (see e.g. [12[ ). 

Background correction of Affymetrix and two-color microarray data has been widely developed in litera- 
ture (see 11 for a review and a comparison). Irizarry et al j5| proposed a parametric model for Affymetrix 
based on a exponential distribution of the signal, called normexp model. Several estimation procedures have 
been developed for this model. The first parameter estimation, still popular today, is the Robust Multi-array 
Average (RMA) procedure. Maximum Likelihood Estimation (MLE), incorporating the negative controls, 



has been later proposed and is considered to be more sensitive to the true parameter values (see 16 ). These 
procedures can be found in Bioconductoij^ packages including limma 17 



lUumina design differs from those of Affymetrix and two-color microarrays by including a set of negative 
probes which do not specifically target any regular probe. Aside from non specific hybridization, these 
negative probes do not hybridize and then have signals close to zero. Thus their observed intensity is 
X = B. As all probes from a given array correspond to the same biological sample and are subject to the 
same technical steps during the analysis process, the noise is generally assumed identically distributed on an 
array and the negative probes provide a sample from its distribution. 

The background correction implemented in lUumina BeadStudio software is the subtraction of the esti- 
mated mean of the negative probe distribution. However, it creates a large amount of probes with negative 
intensities unusable in further analysis. The deletion of these probes is considered in some studies as an op- 
portunity to gain statistical power when the number of strongly differentially expressed genes is large, but it 
can lead to an important loss of information. Ding et al ^2] illustrate this phenomenon in their mice leukemia 
study: a large amount of corrected values are negative only in one group suggesting that the corresponding 
probes have discriminating ability. This issue is confirmed by Dunning et al 3| on spike-in data. 

To avoid this problem, parametric models have been used on lUumina data with parameter estimations 
taking into account the specific design of lUumina microarrays. In this context, the normexp model has 
been first adapted. Ding et al use the Maximum Likelihood Estimation (MLE) based on a Monte-Carlo 
Markov chain approximation and compare their method to an lUumina-adapted RMA procedure using an ad 



hoc rule of thumbs to estimate the parameters. Xie et al 20 go into details in normexp method comparison 
on experimental and simulated data. Lin et al f?] present a variance stabilizing transformation (VST) on 
a model involving both additive and multiplicative noises, which simultaneously denoise and transform the 
data. Replacing the classical log-transformation, VST produces less directly interpretable results and tends 



to produces very small fold changes, as underlined by Shi et al 15 who propose an original approach 
to compare methods offering different bias-precision trade-off by aligning the innate offset generated by 
each pre-processing strategy. They conclude in favor of the normexp model with robust 'non-parametric' 
parameters associated to a quantile-normalization using control and regular probes. 

The spread of each background correction among the lUumina users is hard to evaluate since many authors 
do not mention precisely the pre-processing steps performed in their study. Nevertheless, the normexp model 
that will be especially examined in this paper is included in several widely used packages such as lumi and 
limma of Bioconductor^. 



"^http : //www.bioconductor . org 
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Despite its popularity the normexp model does not properly fit Illumina microarray data. This issue 



has been raised by Wang and Ye 19 , who estimate the density of the signal on an Illumina microarray 
with a kernel-based deconvolution procedure. The shape of the estimated signal density does not present 
the characteristics of an exponential distribution and a gamma modeling seems more appropriate. In this 
paper, we emphasize that the normal-exponential model is not flexible enough to model the signal-noise 
decomposition on Illumina microarrays by showing that the distance between the reconstructed density 
from the estimated parameters and the distribution of the observed intensities is large. 

We propose an alternative model thereafter called "normal-gamma model" which addresses this lack 
of fit. In our model, the normal noise distribution is assumed and the signal on one array is assumed to 
be gamma distributed. As the exponential distribution is a special case of the gamma distribution, this 
model extends the normexp model. The potential of such generalization was already suggested by Xi ete 



al 20 in their discussion. We derive the necessary estimation procedure by likelihood maximization. The 
good quality of fit is attested on two types of Illumina microarrays. The associated background correction 
is compared to methods based on the normexp model in terms of quality of estimation of the signal and 
checked for robustness on simulated data. The characteristics of the background correction procedures are 
compared on a set of spike-in data, and a parallel is drawn with the same characteristics studied on normal- 
gamma simulated data. Finally, the normexp and normal-gamma background corrections are compared on 
two dilution data sets. 

The paper is organized as follows. The experimental and simulated data sets are described in Section [2] 
The methods are presented in Section [3j the notations and the general model-based background correction 
formula are gathered in Section |3.1[ the previous models developed for Illumina microarray background 
correction, including the normexp model, are summarized in Section [3.2[ Section [3 . 3| presents the proposed 
alternative parametric model built with normal noise and gamma distributed signal, as well as a parametric 
estimation procedure and its associated background correction. In Section [4] the performances of this new 
model are evaluated on simulated, spike-in and dilution data sets. The results and perspectives are discussed 
in Section[5] The tables and figures are gathered at the end of the article. Supplementary material is provided 
in Appendix. 



2 Materials 

2.1 Experimental Data sets 

• (El) Nov^fac data [S]. The gene expression profile in peripheral blood from ten controls in the 
Norwegian Woman And Cancer study has been analysed on Human HT-6 v4 Expression BeadChips. 
The whole probe set including 48,000 bead types has been considered, as well as a restricted set of 
25,000 bead types according to Illumina annotation files. Details on laboratory experiments are given 
in Supplementary Material (SM), Section 1.1. 

• {E2) Leukemia mice data 2 . Total RNA from samples of spleen cells from four mice have been 
analysed on Mouse-6 vl BeadChips. Experiment description and data are available in [2] 



{E3) Spike-in data 10 . Human WG-6 v2 BeadChips have been customized to include 34 bead 
types, refered as 'spikes', whose corresponding target sequence is absent from the human genome in 
addition to the 48,000 regular probes. The 34 spikes were introduced at 12 different concentrations 
(0pm, 0.01pm, 0.03pm, 0.1pm, 0.3pm, 1pm, 3pm, 10pm, 30pm, 100pm, 300pm, 1000pm) in a human 
biological sample. Each sample corresponding to a spike concentration has been analysed on four 
arrays. The data are available at littp://rafalab.jhsph.edu/, 

{E4) MAQC data [l]. Two pure samples. Universal Reference RNA (HBRR) and Human Brain 
Reference RNA (UHRR) were mixed in four different proportions (100%/0%, 75%/25%, 25%/75%, 
0%/100%). Five replications of each sample have been analysed on Human WG-6 vl BeadChips. The 
data are available on GEO (access number GSE5350). 
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• {E^) Dilution data [9]. The pure samples UHRR and HBRR were mixed at different proportions 

(100%/0%, 99%/l%, 95%/5%, 90%/10%, 75%/25%, 50%/50%, 25%/75%, 10%/90%, 0%/100%). Each 
mixed sample has been analysed with four different starting RNA quantities (250ng, f OOng, 50ng, f Ong). 
Six Human WG-6 v3 BeadChips were used. 

2.2 Simulated data sets 

For each data set, N — 100 random arrays including a vector of length n^cg = 25000 corresponding to 
the regular probe intensities and a vector X°'^ of length n^cg — 1000 corresponding to the negative probe 
intensities are generated. The values of the nine parameter sets as well as the details of the simulations are 
given in SM, Section 1.2. 

• (Si) Normal-gamma and normexp models. For each repetition £ = 1, . . . , N , 'X.^ is generated as 
the sum of a gamma and a normal-distributed sample, and X*^'^ is drawn from a normal distribution. Six 
sets of parameters are computed from two microarrays in data sets and {E2), based on normexp 
and normal-gamma models in order to get realistic values (sets 1-6). The normexp parameters are 
actually degenerated normal-gamma parameters where the shape is equal to 1. 

• (52) Mixture noise distribution. A mixture of normal and distributions with different propor- 
tions (0, 0.1, 0.25, 0.5, 0.75,1) is considered for the background noise. These distributions model a 
departure from normality with a heavier right tail for larger values of p. The mixture densities are 
presented in SM, Section 7.4. The signal is generated from a gamma distribution with parameter values 
from set 1. 

• (53) Replicates. We mimic replicate measurements from a biological sample by simulating N arrays 
with the same signal sample generated from a gamma distribution. The background noise and negative 
probe intensities are independently drawn from a normal distribution for each array. The values of the 
parameters are computed from the first array in (set 7). Replicates from the normal-exponential 
model are drawn in the same way with parameter values estimated on the same array with two normexp 
estimates (sets 8 and 9). 

• (54) Replicates with empirical background noise. Similarly to {S3), the signal drawn from a 
gamma distribution with parameter values from set 7 is identical on each array. In order to get a 
realistic noise distribution, the negative probe and background noise intensities are sampled from the 
global set of quantile-normalised negative probe intensities measured in the experimental data set (-E3). 

3 Methods 

3.1 General model-based background correction formula 
3.1.1 Notations. 

Throughout this article, the background correction is processed on one single array corresponding to one 
biological sample. For a given probe j, we denote by Xj the observed intensity, Sj the non-observable un- 
derlying signal and Bj its background noise. For a negative control probe, Sj is assumed to be 0. Let J and 
Jo be respectively the index of regular and negative probes on the array. We denote by fx , fs and fs the 
densities of respectively the observed intensity, the unknown signal of interest and the background noise. 

We denote by /""J™ the density of the normal distribution with mean /i and variance and by 4> and 
$ the density and cumulative distribution function of the normal distribution with mean and variance 
1. We denote by = (l/a) exp(— x/a) the density of the exponential distribution with mean a and 
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/ffc™ ~ exp {—x/9) I (Q^Tiky) the density of the gamma distribution with scale parameter B and shape 
parameter fc. The exponential distribution is a special case with fc = 1 and Q = a. 

Given a parametric density and a procedure of estimation of its parameters, we call plug-in density, this 
density for the estimated parameters. 



3.1.2 Model-based background correction 

The model based background correction (BgC) incorporates information from both signal and noise distri- 
butions. Under the additive model ([T]) assuming independence of S and B, fx is the convolution product 
of fs and Jb- For an observed probe intensity x, the signal S is estimated by the conditional expectation of 
S given the observation X = x and the densities fs and fs (more details can be found in 20 ): 

Six) = E[S\X^x,fBjs]^ / /^^^y^/^-'^ d. 

J fx{x) 

sfs{s)fB[x~s)dsl j fs{s)fB{x-~s)ds. (2) 



3.2 Previous modelings 



3.2.1 The normal model for negative probes 

The design of Illumina BeadArrays provides a sample of the background distribution through the negative 
probes. We have compared the density histogram of negative probes to the plug-in normal density fj^°™ 
obtained by using robust estimators of the parameters on data sets (Ei) and (i?2)- 

The results of this comparison are presented in SM, Section 7.2.1. As expected, the empirical distribution 
is essentially normal but we notice a slightly heavier right tail. It may be interpreted as intensities of wrongly 
designed negative probes which partially hybridize with some material present in the biological sample. 



3.2.2 The normal-exponential model 

The normexp model is a parametric model for the noise-signal decomposition on one array. We recall it 
briefly and refer for example to pO] for more details. For every probe j, 



X, = S, + B 



Exp (a), ifjeJ, 
if j e Jo, 



Bj^J^{fi,a^), 
Sj _L Bj, 

where _L denotes the independence between variables. The parameters {fi, cr, a) depend on the given array. 

For computational reason, the A'j's are usually and often implicitly assumed to be independent. The 
existence of pathways between genes violates this assumption. Nevertheless, as a small proportion of genes 
are involved, results are reliable. According to the convolution structure (see Section 3.1.2), the density of 
the Xj's is: 

f;i:^'Lix) = - exp (f^ - ^) $ (x) , (3) 
^' a \2q;-^ a J 

where x = (x — ^ — a'^/a)/a. Denoting Q = (/i, cr, a), from ([2|, the background corrected intensity for an 
observed intensity x is: 



5-P(x|e) = a(x+J^). (4) 
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3.2.3 Normal-exponential model fit 

We consider the data sets (Ei) and {£2). For each array, we apply the following procedure: 



• Computation of the estimators (/i, a, a) of the normexp model with the methods described in 20 

1. Maximum Likelihood Estimation (MLE) using both regular and negative probes, 

2. Robust Multiarray Analysis (RMA) adapted from Affymetrix method, 

3. NP estimation obtained by the method of moments applied to negative and regular probes, 

4. Bayesian estimation. Note that the bayesian estimation results are not presented as they are 
nearly identical to MLE, as pointed out by Xie et al [20|. 

• For each parameter estimation method, plot of the plug-in density f^^~\- 

• Plot of an irregular density histogram of all regular probe intensities of the array using the R-package 
histogram available on the CRAN with default irregular setting (see [l3]). Even though adaptive 
irregular histograms are not commonly used to describe microarray data, they have been proved to 
offer a better approximation in a general framework. Moreover, they appear especially relevant to 
estimate microarray distributions which present high irregularities. 

Figure [6] shows the results for this procedure on one array from {Ei) after removal of imperfectly designed 
probes (more arrays are presented in SM, Section 7.2.2). Apart from the RMA method, the estimated density 
does not fit the density histogram and even the RMA estimator is not satisfying from a statistical point 
of view. One can remark that RMA underestimates the high expressions while the other methods tend to 
overestimate their contributions. 



Besides Xie et al 20 show that the MLE and the NP estimation provide satisfying estimators of the 
parameters on normexp simulated data. Thus the difference between the histogram of the observed intensities 
and the plug-in density does not come from a poor estimation of the parameters but results from an unsuitable 
parametric model. 

3.3 A new modeling: The normal-gamma model 

The poor fitting quality of the normexp model shown above calls for a more suitable parametric model 
for lUumina BeadArrays. According to Section [3.2| the normal assumption for the negative probes appears 
relevant. We consider the gamma distribution as an extension of the exponential distribution to model the 
signal intensities. Besides, as a scale mixture of exponential distributions (see f4l), the gamma distribution 
is a natural generalization which helps to take into account different probe hybridization behaviors which 
could count for different exponential life times. This defines a more flexible parametric model called the 
normal- gamma model that we propose to apply to lUumina BeadArrays. 

3.3.1 The normal-gamma model 

The normal-gamma model is defined as follows. For every probe j: 



r(0,fc), ifjej, 
if j e Jo, 

Sj ^ B,. (5) 

The parameters (/i, a, fc, 9) depend on the given array. This model offers more flexibility than the normexp 
model but requires the estimation of one more parameter. 
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According to the convolution structure (see Section 3.1.2), the density of Xj is the convolution product 
of the densities of Sj and Bj , namely: 



This density does not have any analytic expression as the normexp density ([s]). Nevertheless, good and 
fast numerical approximations can be computed using the Fast Fourier Transform (fft) and some tail 
approximations to ensure stability. Our implementation based on fft is detailed in SM, Section 7.6. 

3.3.2 Parameter estimation in the normal-gamma model 

The parameters (/i, cr, fc,0) of the normal-gamma distribution are estimated by the Maximum Likelihood 
Estimator (MLE): 

{p,a,k,e) = aig max i ((/i, cr, /c, 6l)|X, X°) (7) 

where 

L((/.,a,fc,0)ix,xo) = ncM(^.) • n /"r(^,) 

is the likelihood from the two sets of observations X = {Xj,j e J} and X" ~ G Jo} measured on 

regular and negative probes, respectively. Thanks to the f f t-based approximation of /"^ ^ g the maximum 
likelihood estimation can be numerically computed using classical minimization algorithms (see SM, Section 
7.6). 

3.3.3 Background corrected intensity for the normal-gamma model 

Denoting now 8 = {n, cr, k, 6), we derive from ^ the background corrected intensity for an observed intensity 
x: 

= (8) 

using the equality sf^^™{s) — kO ff^^ g{s) valid for every s > 0. This formula allows ff t-based computations 
for the background correction. 

3.3.4 Inference of negative probes from Illumina detection p-values 

Most publicly available data sets do not present the negative probe intensities. Nevertheless, for each regular 
probe, Illumina provides a detection p- value equal to the proportion of negative probes which have intensities 
greater than that probe on a given array. Following the idea from Shi et al jl4| we propose to infer the 
negative probe intensities from the detection p-values (see details in SM, Section 7.7.1). For the normexp 
and the normal-gamma models, the estimates of the parameters and reconstructed signals obtained with the 
true and inferred negative probe intensities are compared on the ten arrays from {Ei). We observe that the 
error resulting from inference of the negative probe is negligible, with a relative error of order 10^'^ to 10^"* 
on parameter estimation and 10""* to 10~^ on signal estimation (see SM, Section 7.7.2). 
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4 Results 



4.1 Fit on Illumina Bead Array data 



Similarly to Section |3.2[ we compare the irregular density histogram of the regular probe intensities, the 
plug-in normexp densities using RMA, MLE, NP methods, and the plug-in normal-gamma density using a 
Maximum Likelihood Estimate of {^,a,k,9) on the data sets {Ei) and (£2). The results, similar along the 
arrays, are illustrated in Figure [2] on one array from (Ei) (more plots are presented in SM, Section 7.2.2). 
We do not add the MLE and NP plug-in density estimates for the normexp model which have already been 
shown not to fit the data. 

Thanks to the larger flexibility of the normal-gamma model, we observe that the distance between the 
MLE plug-in normal-gamma density and the histogram of the intensities is smaller than the correspond- 
ing distance using the normexp model with any estimation procedure. This graphical result is confirmed 
numerically using the ii-distance between the histogram and the reconstructed density defined by 

hiLh)^ J \f{x)~-h{x)\dx, (9) 

where / represents one plug-in density estimate using either the normexp model or the normal-gamma 
model and h represents the irregular density histogram obtained by using the R-package histogram. Table 
[1] presents the mean of the relative deviation for the normexp estimators with respect to the deviation for 
the normal-gamma estimator: 



mean 



where fi is a normexp estimator of the regular probes density, and is the normal-gamma estimator for 
individual i. The mean is computed over the ten arrays from (Ei) (with and without the non specific binding 
probes) and over the four arrays from (i?2)- 

The mean absolute deviation is 3 times smaller in favor of the normal-gamma density with respect to the 
normexp density using the RMA estimate, and 4 to 8 with respect to the normexp model using the MLE or 
NP estimates. 



4.2 Quality of estimation on simulated data 

The quality of estimation of the normal-gamma model is assessed on simulation the data set (5*1). The 
first two sets of parameters are non degenerate normal-gamma parameters, more realistic for modeling 
Illumina microarrays as shown in Section [4.1[ They are used to evaluate the MLE normal-gamma parameter 
estimation and validate the associated background correction, and to quantify the improvement brought by 
the new normal-gamma background correction. The last four sets are actually degenerate normal-gamma 
parameters where the shape parameter k is set to 1, corresponding to normexp data, which enables to assess 
the potential loss of precision in parameter estimation brought by a more flexible modeling. 



4.2.1 Parameter estimation 

For each repetition i = l,...,iV we compute the normal-gamma MLE {11^,a^,k^,9^) of the parameters. 
Table [2] presents the relative Li-error for each parameter (3 e {/i, cr, fc, 6*}: 

£=1 

The parameter estimation is of excellent quality for the gaussian distribution and of good quality for the 
gamma distribution. 



8 



To check wether the introduction of a fourth parameter in our model leads to a loss of precision in the 
parameter estimation, we compare the relative Li errors of the MLE parameter estimation in the normal- 
gamma and normexp models using the parameter sets 3 to 6, corresponding to normexp data. The results 
summarized in Table [3] indicate that the quality is unchanged for the variance parameter and that we pay a 
price of order 2 for /i and 6. Nevertheless, since the relative errors in these cases have order 10~^, this loss 
is negligible. 



4.2.2 Background corrected intensity 

We now study the performance of the normal-gamma background correction (BgC) obtained in (|8l with 



respect to the existing BgC methods (see 20 ) in terms of quality of estimation of the signal on the simulated 



data set (5*1). We compare the following BgC methods: 

0. Normal-gamma BgC in Q with true parameters, 

1. Normal-gamma BgC in Q with MLE parameters, 

2. Normal-exponential BgC in Q with MLE parameters (referred to as normexp-MLE), 

3. Normal-exponential BgC in Q with RMA parameters (referred to as normexp-RMA), 

4. Normal-exponential BgC in Q with NP parameters (referred to as normexp-NP), 

5. Background subtraction: 

5™'^(x) = max (x - median{Xj ,j e Jo},0). 

These methods are further denoted by S^^^ for i = 0, . . . , 5. For methods 1 to 4, the BgC is a two-step 
procedure: the parameters are estimated and then plugged respectively into ([s]) for method 1, and into Q 
for methods 2 to 4. From a practical point of view, as the parameters are unknown, S^^'^ is unavailable. 
Nevertheless, as the result of a procedure with a perfect first estimation step, it allows a comparison to 
quantify the performance of the second step. 

For each parameter set and for each BgC method S = 5*^'^ , i = 0, . . . , 5 we compute the Mean Absolute 
Deviation (MAD): 

MAD(5) = -E|—E 1^(^11©^ 



where Qi = {fie,ae,ke,9e) denotes for each simulated array £ the estimated parameters corresponding to 
the used methods with the following conventions: 1/ is the true parameters for i — 0; 2/ fc^ = 1 for 
? = 2, . . . , 4 corresponding to the exponential distribution; 3/ Qi represents the median over X''''^ for i — 5. 
Simulation results are summarized in Table [4] for the parameter sets 1-6. The five first columns correspond 
to the excess risk ratio: 

i?(i) = MAD(5(''))/MAD(5(")), fori = l,...,5 (10) 

and the last column indicates the reference risk MAD(S'^°'). 

The normal-gamma BgC provides the same quality when the parameters are known or estimated. This 
holds when the data are generated either from a normal-gamma or a normexp model. Normexp-NP shows 
good behaviors when the data come from a normexp model but has a risk increase of order 60% if the data 
come from a normal-gamma model. Normexp-MLE provides good results for normal-exponential data but 



fails when the data come from a normal-gamma model. Not surprisingly, as already pointed by Xie et al 20 
normexp-RMA has a poor behavior. The background subtraction method with a maximal quality loss of 
32% offers an acceptable alternative in terms of risk. Indeed, most of the intensities being small, putting 
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them to does not affect significantly the MAD value. Let us recall, however, that from a practical point of 
view, the main default of that BgC is to eliminate a huge number of probes. 

In practical experiments, the data are usually transformed before the analysis. To address this issue, the 
MAD is computed on log-transformed intensities (see SM, Section 7.3 for details), and the excess risk ratio is 
displayed in Table [Sj The normal-gamma BgC presents a smaller error of estimation than the methods based 
on the normexp model. The MAD from normexp-MLE generates the highest value, and normexp-RMA and 
normexp-NP show a a similar moderate excess risk. Nevertheless the differences between the BgC methods 
are less pronounced than the one observed at the raw scale. However, with an excess risk ratio between 
16% and 33%, the normexp methods notably under perform the BgC based on the normal-gamma model in 
terms of signal estimation. 

The MAD computation offers a global comparison of the various BgC methods in terms of signal esti- 
mation. We refine this analysis by examining the absolute deviation (AD) of the estimated signal for each 
signal intensity at the raw and the log scales, respectively defined as: 

AD{S,)^-Y,\S{X'^^\Q,)-S'^ 

1=1 

1 ^ 

AD{\og{S,)) = -Y. [s{X'j\^S) - log {S]) 
1=1 

The first row of Figure [3] displays the logarithm of the AD at the raw scale, as a function of the log- 
signal intensity. We observe that normexp-MLE presents a larger AD for all values of the signal. On small 
intensities, the normal-gamma BgC outperforms the other methods, whereas normexp-RMA and normexp- 
NP present a smaller deviation on moderate intensities. For high values of the signal, normal-gamma shows 
a smaller error of estimation together with normexp-NP. 

The absolute deviation on log-transformed intensities is presented on the second row of Figure |3] The 
normal-gamma BgC still presents the smallest error of estimation on weak intensities, but is outperformed by 
the other methods on moderate intensities. The four methods present similar AD values on high intensities. 
Besides, we observe than the error of estimation for all methods increases as the signal become weaker. 



4.2.3 Robustness 



In Section [3.2[ we have underlined the slightly heavier right tail of the negative probe distribution. To en- 
sure that the estimation remains acceptable under the assumption of an imperfect noise parametrisation, we 



compare the robustness of the normal-gamma method with normexp-NP, stated as the most robust by 20 



and which we found competitive (see Section 3.1.2). The errors of estimation computed from the simulation 
data set (^2) are presented in SM, Section 7.4. Both methods are robust with respect to non-normal noise 
distribution, and the normal-gamma BgC still offers a better quality of estimation than normexp-NP when 
the noise distribution departs from normality. 



In conclusion, the normal-gamma background correction globally offers a better quality in signal estima- 
tion with respect to the normexp methods. Nevertheless, this improvement depends on the scale considered 
and does not steadily hold over the range of intensities. 



4.3 Operating characteristics 

Beyond the quality of estimation of the signal, the performance of a BgC procedure in practical experiments 
depends on its characteristics in terms of bias and variance. In this section, we compare the operating 
characteristics of the normal-gamma and normexp BgC both on simulated and spike-in data. The results 
are gathered in Figure |4] The background subtraction leading to probe deletion is not further considered. 
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The data from (E^) are background corrected with the methods 1 to 4 described in Section 4.2.2 Quantile 
normahzation based on both regular and negative probe intensities is apphed, followed by log-transformation. 
The same procedures are implemented on the simulation data sets (5*3) and (5*4). 



4.3.1 Bias-precision trade-ofT 

The quality of a pre-processing method in microarray experiments can be characterised by its ability to dis- 
tinguish between distinct values of the signal. Most of the procedures underestimate the signal fold-changes. 
This bias in fold-change estimation, called compression, has a negative impact on differential analysis. But 
the efficiency of a pre-processing method also depends on its precision, characterised by the variations of the 
corrected intensity for a given value of the signal. The trade-off between bias and precision is an indicator of 
the performance of a procedure. This issue can be understood by the example of a t-test statistic for a given 
probe differentially expressed between two groups: an important compression attenuates the difference of 
average intensities between the two groups, whereas a poor precision generates a high variance term in the 
denominator, which reduces the value of the test statistic. 

The compression and precision obtained with the four BgC methods on the data set {E^) are presented 
on the first row of Figure |4] The first column displays the average intensity over the 34 spike bead types 
for each spike concentration. A similar saturation effect is observed for large concentrations with the four 
methods: for concentrations larger than 100pm, the relationship between log-intensity and log-concentration 
is not linear. Moreover, as the concentration decreases, a compression of the signal is observed with all the 
methods, but is significantly less pronounced with the normal-gamma BgC. 

The second column presents the average standard deviation between replicates over all spike bead types. 
We observe that the improvement in bias brought by the normal-gamma model is at the cost of a poorer 
precision. More generally, the precision increases with the compression for the four methods. 



4.3.2 Innate offset 



Shi et al 15 highlight the role played by the "innate offset", defined as the typical intensity assigned to the 
non-expressed genes by a pre-processing procedure, in the unbalanced bias-variance trade-off of the various 
BgC methods: the strategies which show the smallest innate offset usually generate less bias but present a 
poorer precision. On spike-in data sets, the innate offset is defined as the mean of the intensities measured 
on spike probes with concentration zero. The results displayed in Table |6] confirm the observations from 



Shi et al 15 : as underlined above, the normal-gamma BgC exhibits the smallest precision associated with 
the smallest bias with a slope from the linear regression of intensities on log-concentrations close to 1. The 
largest offset combined with the highest precision and the largest bias are observed for normexp-MLE. 



Shi et al 15 propose to compare the pre-processing methods in a more equal way by adding an offset 
to the background-corrected quantile-normalised intensities before log-transformation, in order to align the 
innate offsets of the various pre-processing strategies. Our results presented in SM, Section 7.5.1, indicate 
that the characteristics between the four BgC present more similarity after equalizing the innate offsets, 
but a slight difference remains between normexp-MLE and the other BgC methods on small intensities. 
Nevertheless, prior to the offset equalisation, the methods studied in this paper do not present the large 



range of bias-precision trade-offs observed in the pre-processing strategies considered in 15 . In this context, 
the equalisation of the innate offsets does not appear sufficient to completely erase the differences between 
BgC methods. 



4.3.3 Operating characteristics on simulated data 

In order to reinforce the validation of the normal-gamma parametrisation for the noise-signal distribution, we 
compare the operating characteristics obtained on spike-in data to the ones provided by the normal-gamma 
simulated data from set {S3). The spike concentration, used as references to assess the bias and precision 
of the procedures on spike-in data, is replaced by the true value of the signal. The results are displayed 
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on the second row of Figure |4] The first column presents the average intensity as a function of the signal 
log-intensity, and the standard deviation of the rephcates is shown in the second column. The trends are very 
similar to the ones observed on spike-in data, with a small difference for the variance with normexp-RMA. 
Besides, we observe that the compression in small intensities generated by the four BgC methods is purely a 
statistical effect. However, the signal attenuation in high intensities observed on spike-in data is not present 
on simulated data. Indeed, it has already been suggested that this phenomenon could come from saturation 
in light intensity on microarrays. 

Furthermore, we address the departure from normality observed on the negative probe distribution by 
simulating microarrays with a gamma distributed signal and a non-normal background noise (data set (<S'4)). 
In order to get a realistic noise distribution, the background noise and the negative probe intensities are 
sampled from the quantile-normalised negative probe intensities from all arrays in {E3) (see details in SM, 
Section 7.1.2). The operating characteristics of the four BgC methods are presented on the third row of 
Figure |4] We observe that the slight difference between normal-gamma simulations and spike-in data with 
normexp-RMA, observed on the second row of Figure |4] is partially corrected by generating a non-normal 
background noise. 

The same quantities are computed based on normal-exponential simulated data with parameter sets 8 
and 9. The results are displayed on the fourth row of Figure |4] for set 9, and on Figure E in SM for set 8. 
The comparison between the operating characteristics of the four BgC methods are absolutely not consistent 
with the observations from the spike-in data. In particular, the normal-exponential simulated data provide 
almost identical bias and precision curves for normal-gamma, normexp-NP and normexp-MLE, whereas 
these methods exhibit notable differences on spike-in data. 

The parallel drawn between the operating characteristics of the four BgC methods on spike-in and simu- 
lated data confirms that the gamma model represents a much more accurate parametrisation for the signal 
distribution than the usual exponential model. 



4.4 Differential expression analysis 

The BgC methods are compared from a practical point of view through a differential expression analysis 



performed on the dilution data set (-E4), based on the hierarchical linear model approach from Smyth 18 
implemented in the limma package. This procedure provides p-values from a moderated t-statistic. A first 
analysis is run on the two pure samples (proportions 100%/0% and 0%/100%) to define the "true" differ- 
entially expressed (DE) and non-differentially expressed probes. A second differential expression analysis 
performed on the two mixed samples (proportion 75%/25% and 25%/75%) is used to assess the performance 
of the BgC methods. The moderated t-test statistic implemented in the limma package includes a variance 
term representing the variation of the gene intensity across all arrays, as well as hyperparameters computed 
from the whole data set intensities. Therefore, in order to get independent results, the two analyses are 
performed on separate linear models. 

The estimate proportion of DE probes in pure samples computed with a convex decreasing density 
procedure [6] is 28% for all methods. Thus, in order to be conservative, we define the probes with the 20% 
smallest p-values as "true DE", and the probes with the 40% highest p-values as "true non-DE". Moderated 
t-statistic values are then computed from the comparison of the two mixed samples. The p-values from 
true DE and non-DE probes are ordered. The area under the ROC curve (AUC) is used to quantify the 
sensitivity of each BgC method, the largest value of the AUC corresponding to the highest sensitivity. The 
four methods present similar AUC values but the normal-gamma BgC is slightly less competitive. 

A similar analysis is run with the addition of an offset prior to log-transformation. Figure [5] displays the 
AUC values as a function of the added offset with each BgC method. The values observed for an offset equal 
to zero correspond to the sensitivity when a simple log-transformation is applied. As already highlighted 



by Shi et al 15 , we observe that the addition of a moderate offset increases the sensitivity of all BgC 



methods. For any value of the offset, the normexp methods outperform the normal-gamma. The results 
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obtained with the different normexp BgC methods are very similar, but we note that the highest sensitivity 
is achieved with normexp- RMA for offsets smaller than 50, and with normexp- MLE for offsets larger than 50. 

The BgC methods can also be compared regarding their ability to order a set of measured intensities 
corresponding to increasing or decreasing probe concentrations. This framework can refer, for example, to 
a longitudinal study where the gene expression is repeatedly measured at different times. The correlation 
between the mixture proportion and the intensity is analysed on the dilution data set (-E5). For the true DE 
probes, the intensity is expected to be increasing or decreasing with the proportion. 

The dilution data sets {E4) and (E^) are based on the same pure biological samples. Therefore, true DE 
and non-DE probes defined on {E4) can be considered in the analysis of the data from (-E5). The BeadChips 
used in experiments (-B4) and (E^) are different, but some bead types are present on both devices. By 
mapping the annotation files from both BeadChips, the sets of probes respectively defined as DE and non- 
DE on (£^4), and present on (£'5) are extracted. 

For each probe, the Spearman correlation coefficient is computed between the vector of mixture propor- 
tions and the observed intensities. This provides a test statistitic based on the ranking of the background 
corrected intensities, which allows a comparison of the BgC methods independently from the scale at which 
the data are analysed, provided that the transformation applied to the data is increasing. In particular, 
the results are not affected by the addition of an offset. The correlation coefhcient is computed separately 
on microarrays with starting RNA quantities 250ng, lOOng, 50ng and lOng. The coefficient is expected to 
be close to 1 in absolute values for the DE probes, and close to for the non-DE. The probes are ranked 
according to their correlation coefficient value, and the resulting AUCs for each starting RNA quantity are 
displayed in Table [7| We observe that the normal-gamma BgC is slightly but steadily less sensitive that the 
other methods. The AUCs observed with the different normexp estimates are very similar and do not allow 
to assess the superiority of one method over the others. 

5 Discussion 

In many microarray experiments, background noise correction is an important issue in order to improve the 
measurement precision. Model-based background correction procedures have been developed as an alterna- 
tive to the default background subtraction from Illumina BeadStudio which has proved to remove informative 
probes. The usual normal-exponential model considered for the noise-signal distribution has already been 
pointed out as inappropriate for Illumina BeadArrays [l9]. Our observations confirm this result by high- 
lighting the poor fitting of the normexp reconstructed densities on observed intensities with three different 
parameter estimates . We propose an alternative model based on a more flexible parametrisation of the 
signal which is assumed to follow a gamma distribution, as well as the associated background correction. 
The reconstructed density offers a better fit of the distribution of the observed intensities, validating this 
new model as more appropriate for Illumina microarrays. 

We compare the performance of the background correction procedures based on the normal-gamma and 
normal-exponential models, on simulated and experimental data sets. Our simulation study indicates that 
the normal-gamma model brings an overall improvement in terms of signal estimation, characterised by a 
smaller average difference between the true signal and the background corrected intensity. But surprisingly, 
the differential expression analysis run on two dilution data sets shows that the improvement in terms of 
parametrisation does not have a positive impact on practical experiments. This result may be explained in 
two ways. 

On one side, the operating characteristics of the background correction procedures are compared on a set 
of spike-in data, which allow to connect the probe intensity with the concentration of the target gene in the 
biological sample. We note that the normal-gamma model generates less bias than the normexp methods, 
but at the cost of a loss in precision. With the addition of an offset prior to the log-transformation, which 
provides balance in the bias-precision trade-off of the different methods, the operating characteristics appear 
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similar, suggesting comparable performance. 

On the other side, we examine the error in signal estimation as a function of the signal on log-scale 
simulated data. The normal-gamma model outperforms the other methods on small intensities, but is less 
competitive on moderate intensities. Due to the marked compression of the recovered intensity when the 
signal decreases, the improvement in terms of signal estimation for the small intensities has a weak effect on 
the differential expression analysis. Thus, the smaller average error of estimation observed with the normal- 
gamma background correction does not result in a higher sensitivity in practical experiments. 

Besides, the parallel drawn between the operating characteristics of the different background corrections 
obtained, on the one hand with spike-in data and on the other hand with normal-gamma simulated data, high- 
lights high similarities. The simulations from the normal-gamma model recover subtile differences between 
background correction procedures, whereas simulations from the normexp model totally fail to reproduce the 
trends noticed on spike-in data. These considerations enhance the validation of the normal-gamma model 
for lUumina microarrays, and illustrate the potential of the normal-gamma simulations for the comparison 
of pre-processing procedures. Furthermore, the similarities between the observations from spike-in and sim- 
ulated data are increased by sampling the background noise from the empirical negative probe distribution 
which suggests that an improvement in modeling could be brought by a non-normal parametrisation of the 
background noise. 

In conclusion, this paper addresses the lack of fit of the usual normal-exponential model by proposing a 
more flexible parametrisation of the signal distribution as well as the associated background correction. This 
new model proves to be considerably more accurate for lUumina microarrays, but our results indicate that the 
improvement in terms of modeling does not lead to a higher sensitivity in differential analysis. Nevertheless, 
this realistic modeling makes way for future investigations, in particular to examine the characteristics of 
pre-processing strategies. 



6 Figures and Tables 
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Figure 1: Normal-Exponential estimation for one array from (£^i) after removal of imperfectly designed 
probes: irregular density histogram of all regular probe intensities and the plug-in normexp density of the 
regular probes with MLE, RMA and NP parameter estimates. 
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Figure 2: Normal-Gamma estimation for one array from (Ei) after removal of imperfectly designed probes: 
irregular density histogram of all regular probe intensities, plug-in normexp density with RMA estimate and 
plug-in normal-gamma density with MLE estimate. 





Human 
(all probes) 


Human 
(remove bad probes) 


Mice 


nexp MLE 


7.09 


5.14 


4.83 


nexp RMA 


2.96 


3.18 


2.71 


nexp NP 


7.69 


5.50 


5.29 


Abs Dev normgam 


0.17 


0.21 


0.20 



Table 1: Average deviation between normexp reconstructed density and histogram divided by the deviation 
between normal-gamma reconstructed density and histogram (First row: RMA estimator; second row: MLE 
normexp estimator; third row: NP normexp estimator). The fourth row gives the mean deviation of the 
normal-gamma estimator as a reference. The mean is computed over the ten arrays from (Ei) with (first 
column) or without (second column) the non specific binding probes and over the four arrays from {E2) 
(third column). 
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Figure 3: Logarithm of the Absohite Deviation of estimated signal on raw scale (first row), Absolute Deviation 
of log-transformed estimated signal (second row) and signal log-density (third row). Normal-gamma BgC 
(purple) and normexp BgC with MLE (pink), RMA (blue) and NP (green) parameters. 
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Figure 4: Operating characteristics of the BgC niethods on spike-in and simulated data. Row 1: average 
spike intensities (left) and standard deviation of spike replicates (right) for all non-zero spike concentrations. 
Row 2 to 4: average intensity (left) and standard deviation of replicates (right) as a function of signal 
intensity. Row 2: normal-gamma simulation in data set (S3) (parameter set 7); Row 3: gamma signal and 
empirical background noise distribution (data set (<^4^); Row 4 normal-exponential simulation in data set 
{S3) (parameter set 9). 




Offset 



Figure 5: AUG as a function of added offset. AUG from moderated t-test for mixed sample differential 
analysis in data set {E4) (proportion 25%/75% and 75%/25%) for different values of offset. 







(7 


k 


6 


set 1 


7.1E-4 


5.6E-3 


9.3E-3 


1.7E-2 


set 2 


1.3E-3 


5.5E-3 


I.Oe-2 


1.8E-2 


set 3 


3.5E-3 


1.6E-2 


6.9E-3 


8.3E-3 


set 4 


4.5E-3 


1.3E-2 


8.9E-3 


9.8E-3 


set 5 


2.1E-3 


7.6E-3 


2.6E-2 


1.7E-2 


set 6 


3.5E-3 


7.2E-3 


3.9E-2 


2.4E-2 



Table 2: Relative Li-error for each parameter in the normal-gamma model using MLE estimates. 



7 Appendix: supplementary material 

7.1 Experimental and simulated data sets 
7.1.1 Laboratory methods for NOWAC data set {Ei) 

Data set {Ei ) consists of the gene expression profiles in peripheral blood of ten controls from the Norwegian 
Women And Gancer cohort. The gene expression profiles were measured on source cells using lUumina Human 
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H <j 6 



set 3 


1.1 


1.0 


1.6 


set 4 


1.2 


1.0 


1.8 


set 5 


2.1 


1.0 


2.3 


set 6 


1.9 


1.0 


1.9 



Table 3: Error in parameter estimation. Ratio between the relative Li errors of the MLE estimation in the 
normal-gamma and in the normexp models for (/U, a, 6) from normexp data. 



{li, a, k, 9) 


i?(l) 


R{2) 


i?(3) 


i?(4) 


R{5) 


MAD(5(°)) 


set 1 


1.00 


4.16 


1.77 


1.52 


1.16 


2.34 


set 2 


1.00 


4.10 


1.90 


1.66 


1.20 


11.7 


set 3 


1.00 


1.00 


4.69 


1.00 


1.00 


4.57 


set 4 


1.00 


1.00 


3.71 


1.00 


1.02 


31.4 


set 5 


1.00 


1.00 


2.11 


1.00 


1.15 


2.95 


set 6 


1.00 


1.00 


1.46 


1.00 


1.35 


17.2 



Table 4: Excess risk ratio of background corrected raw-scale intensities. Mean Absolute Deviation (MAD) of 
the background corrected intensities for methods S^^\ j = 1, . . . , 5 divided by the MAD for the theoretical 
normal-gamma BgC with the true parameters (method 5^°^), from the simulation data set (Si). Column 
1: normal-gamma, column 2: normcxp-MLE. column 3: normcxp-RMA, column 4: normcxp-NP, column 5: 
background subtraction. The MAD of the theoretical normal-gamma deconvolution with the true parameters 
is given as reference in column 6. 

HT-6 v4 Expression BeadChip (Illumina, San Diego, CA), which enables genome- wide expression analysis 
(more than 48 000 transcripts) of six samples in parallel on a single microarray. A restricted set of more 
than 25,000 probes is also considered by removing non-reliable probes according to lUumina's annotation 
files. The Illumina TotalPrep RNA amplification Kit (Ambion Inc., Austin, TX) was used to amplify UNA 
for hybridization on Illumina BeadChips. To synthesize the first strand cDNA by reverse transcription, we 
used totalRNA from each sample collected above. Following the second strand cDNA synthesis and cDNA 
purification step, the in vitro transcription to synthesize cRNA was prepared overnight within 12 hours. The 
microarray service was provided by NMC-NTNU, a Norwegian national technology platform supported by 
the functional genomics program (FUGE) of the Research Council of Norway. 

7.1.2 Simulation of microarray data 

Simulation experiment (6*1). Microarray data arc simulated from both normal-gamma and normexp 
models. In order to get realistic values of the parameters, six sets of parameters are computed by applying 
normal-gamma MLE, as well as normexp MLE and RMA estimation methods to observed intensities from 
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in,a,k,e) R(l) R(2) R(3) R(4) 



set 


1 


1.00 


1.32 


1.18 


1, 


.17 


set 


2 


1.00 


1.28 


1.16 


1, 


.16 


set 


3 


1.00 


1.00 


2.98 


1, 


.00 


set 


4 


1.00 


1.00 


2.45 


1, 


.00 


set 


5 


1.00 


1.00 


1.81 


1, 


.00 


set 


6 


1.00 


1.00 


1.39 


1, 


.00 



Table 5: Excess risk ratio of background corrected log-transformed intensities. Mean Absolute Deviation 
(MAD) of the background corrected intensities for methods S^^\ j = 1, . . . , 5 divided by the MAD for the 

theoretical normal-gamma BgC with the true parameters (method S**-"^), from the simulation data set (Si). 
Column 1: normal-gamma, column 2; normexp-MLE, column 3: normexp-RMA, column 4: normexp-NP. 



BgC 


Innate offset 


Stand. Dev. 


Slope 


normexp MLE 


23.4 


0.095 


0.74 


normexp NP 


12.4 


0.100 


0.80 


normexp RMA 


6.9 


0.110 


0.86 


normal-gamma 


1.5 


0.200 


0.99 



Table 6: Innate offset, average standard deviation of spike replicates and slope of the linear regression of the 
spike average intensity on the log-concentration. 





Normal- 


Normexp- 


Normexp- 


Normexp- 




gamma 


MLE 


RMA 


NP 


250ng 


0.9778 


0.9812 


0.9813 


0.9820 


lOOng 


0.9774 


0.9807 


0.9809 


0.9808 


50ng 


0.9805 


0.9834 


0.9832 


0.9841 


lOng 


0.9782 


0.9818 


0.9787 


0.9816 



Table 7: AUC from Spearman correlation test between the proportion and the intensity in the dilution data 
set (-B5), for the four BgC methods, and the four RNA starting concentrations. 
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one microarray from {Ei) and {£2). The values of the six sets of parameters are summarized in Table A 
(sets 1-6). For each set of parameters, N = 100 random arrays are generated under the normal-gamma 
model, with Ureg = 25000 regular probes and rineg = 1000 negative probes. For each repetition ^ = 1, . . . , A'", 
two independent samples and X°'^, corresponding respectively to the expression levels of the regular and 
negative probes, are generated: 

. = = 5| + B|, j = 1, . . . , n,eg} with 

{Sj,j = 1, . . . , rireg} independent identically distributed (i.i.d.) sample from a gamma distribution 
with shape k and scale 9. 

{Bj,j = 1, . . . ,nreg} i.i.d. sample from a normal distribution with mean ^ and variance a^. 

• X°'^ = {Bj'^,j = 1, . . . ,nneg} i-i-d. sample from normal distribution with mean fj, and variance a^. 

Simulation experiment (52). The procedure is similar to (^i), but the background noise and the 
negative probe intensities are generated from a mixture distribution 

(1-p)AA(m(t)+Px'(3,55) 

where x^(3, 55) is a x^-distribution with 3 degrees of freedom and a non-centrality parameter equal to 55, 
for p in (0.1, 0.25, 0.5, 0.75, 1). The values of the normal-gamma parameters are given in Table A, set 1. 

Simulation experiment {S3). The parameter values for this experiment, computed from one array 

in experimental data set (-E3) are given in Table A: set 7 is computed from the normal-gamma model, and 
sets 8 and 9 are estimated from the normexp model with MLE and RMA parameter estimates. In order to 
mimic replicates from the same biological sample, the vector of signal intensities S, drawn from a gamma 
distribution is identical on the N arrays. Then, for each repetition £ = 1, . . . ,N, two independent samples 
and X°'^ corresponding respectively to the intensities of the background noise on regular probes and the 
intensities of negative probes are generated. 

• X^ = = Sj+B^.,j = l,..., n,eg} with 

{^j^j = 1) • • • ) ^reg} = S. 

{Bj,j = 1, . . . ,nreg} i.i.d. sample from a normal distribution with mean /x and variance cr^. 

• X°'^ = {Bj'^,j = 1, . . . ,nneg} i-i-d. sample from a normal distribution with mean /it and variance cr^. 

Finally, X^ and X°'^ stand for the regular and negative probe intensities on array i. Note that the 
parameter sets 8 and 9 with a shape value of 1 correspond to an exponential distribution. 

Simulation experiment (5*4). This data set is simulated based on a gamma distributed signal and a 
non-normal background noise. In order to get a realistic distribution of the background noise, the negative 
probe intensities from the 48 arrays in experimental data-set (-E3) arc quantile-normalizcd independently 
from the regular probes, and gathered in a vector Dneg. The signal identical over all arrays is simulated from 
a gamma distribution with parameters k and 6 from set 7. Then, for each array £, two vectors and X°'^ 
of length rirog and rinog standing for the background noise and the negative probe intensities are sampled 
with replacement from Dneg. 
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a 


k 


e 


Exp. data set 


est. method 


set 1 


53 


4.4 


0.12 


1785 




/ng + MLE 


set 2 


138 


24 


0.11 


4949 




/"s + MLE 


set 3 


43.5 


5.8 


1 


226 


(El) 


ynexp ^ ]^LE 


set 4 


170 


41 


1 


505 


{E2) 


jnexp _^ ]y[Lg 
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52.8 


5.0 


1 


8.33 


(El) 


jnexp _j_ J^MA 


set 6 


223 


37 


1 


33.8 


{E2) 


ynexp ^ j^]y[^ 


set 7 


93 


11 


0.08 


3230 


(Es) 


/ng + MLE 


set 8 


69 


13 


1 


277 


(Es) 


/nexp ^ ]y[LE 


set 9 


92 


14 


1 


10.5 


(Es) 


^nexp ^ j^]y[^ 



Table A. The nine sets of parameters used in the simulations by using on array {E2) and {E3) with 

the three methods of estimation. The k = I value corresponds to a normexp model with an exponential 
distribution with mean 9. 

7.2 Fit of normexp and normal-gamma model on experimental data sets (£'1) 
and {E2) 

7.2.1 Fit of normal model on negative probe distribution 

On several arrays from (Ei) and {E2), we have compared the density histogram of negative probes to the 
plug-in normal density obtained by using robust estimators of the parameters {n, a): 

J /i = mcdian({Xj, 7 € Jo}) 

\ = mcdian({|Xj - A^l , j G Jo})/0.6745. 

The results of this comparison, similar through the arrays, are illustrated in Figure B on one array from 
(£^1) and (-E2). This figure presents a regular density histogram of the negative probe intensities as well as 
the plug-in normal density. The empirical distribution is essentially normal but we notice a slightly heavier 
right tail. 

7.3 Fit of the normexp and normal-gamma models. 

Figure B presents the fit of the normal-gamma and normexp models with various parameter estimates on 
two arrays from (£^1) and (£2)- 
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Figure A. Regular density histogram of the negative probe intensities and plug-in normal density using a 
robust estimate of the parameters (red dotted line) built on the negative probes. Left: one array from 
right: one array from (£2). 



7.4 Quality of the signal estimation at the log-scale on the data set (5*3) (set 7) 

For each background correction procedure S"*^*^ i = 0, . . . , 4 described in Section 6.2.4, we compute the Mean 
Absolute Deviation of the bakground corrected log-transformed intensities over N = 100 repetitions. Note 
that the background corrected values after background subtraction include negative values and therefore can 
not be log-transformed. For each parameter set 1 to 6 and for each BgC methods S — S*^*^, i = 0, . . . ,5 the 
MAD on the log-transformed intensities is defined as: 

1^/1 "'"^ ^ ^ \ 
MAD(log(5)) = - ^ — ^ |log(^(Xj|e,)) - log(5j) 

i=i \ '•°g j=i J 
For each method 5^*', we compute the excess risk ratio: 

R{i) ^ MAD(bg(S'(*)))/MAD(bg(S'("))), for i = 1, ... ,5 
The results are displayed in Table 5. 

7.5 Robustness with respect to non-normal background noise. 

We check the robustness of the normal-gamma BgC with respect to heavier right tails in the negative probe 
distribution from the simulation data set (S'2), and compare the results obtained with normexp-NP. Figure 
D presents the density of the mixture distribution used for checking robustness. Table B presents the MAD 
computed from the normal-gamma BgC (first column) and the normexp BgC with NP estimation (second 
column) . 
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Figure B. Normal-gamma and normal-exponential fit on one array from (Ei) with the full set of regular 
probes (first row) and one array from {E2) (second row). On each figure the irregular density histogram 
of the regular probe intensities is displayed. On left column are presented the plug-in normexp density 
with MLE, RMA and NP parameter estimates, and on the right column the plug-in normal-gamma density 
estimator, together with the normexp-RMA estimator. 
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p 


MAD(S'(i)) 


MAD (5^(4)) 





2.38 


3.59 


0.10 


3.12 


4.88 


0.25 


4.19 


6.51 


0.50 


5.70 


8.72 


0.75 


6.82 


10.53 


1 


7.59 


12.01 



Table B. MAD of background corrected intensities from simulated data with background distribution defined 
as a mixture (1 — p)A/'(50,4) +px^(3, 55) and signal is generated from a r(0.12, 1785). 





Figure C. Density of the mixture distribution (1 — p)A/'(53, 4.4) + px^(3, 55) for various values of p. 
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7.6 Operating characteristics 



7.6.1 Equal Innate offset on spike-in data set (E^) 

An offset has been added to the regular probe intensities prior to log-transformation for each method, in 
order to equalize innate offsets. The results displayed in Figure D corresponds to a total offset equal to 
100 (i.e. an added offset of 98.5 for the normal-gamma BgC, 76.5 for normexp-MLE, 93 for normexp-RMA 
and 87.6 for normexp-NP). The first column presents the average intensity, the second column displays the 
estimated log-ratio for each pair of consecutive concentrations, corresponding to a true fold-change of 3 or 
3.33.. The thirs column shows the standart deviation between replicates. 

We observe that the precision is globally similar, but the difference on small concentration is cnlighted by 
the observation of the fold-change for every pair of consecutive concentrations. We note that normexp-MLE 
provides a smaller bias (larger log-ratios) and a poorer precision. 
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Figure D. Average spike intensity (left), average log-ratio for every pair of consecutive spike concentrations 
(center), and average standard deviation between spike replicates (right) as a function of the spike log- 
concentration after equalisation of the innate offsets on raw data. 



7.6.2 Operating characteristics on normal-exponential simulated data 

The operating characteristics of the four BgC methods on the data set (5*3) with parameter set 8 are displayed 
in Figure F. 
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Figure E. Average log-intensities (left) and log-variance between replicates (right) for simulation data set 
{S3) with parameter set 8. 



7.7 Numerical implementation 

7.7.1 Expression of the normal-gamma density 

The convolution product of the normal and gamma density: 

does not have an analytic expression similar to the normal-exponential density. Nevertheless, an expression 
involving the Kummer's special mathematical function may be obtained by using the software Mathematica. 
Denoted by kum the M-Kummer's special function, 

r (I) kum (i i ^) - V2yr (i^) kum f, 

where y = (cr^ + 6'(/i — x))/{(t6). The R-package f AsianOptions implements the M-Kummer's special 
function. Unfortunately, it does not offer enough numerical stability to be used for microarray expressions, 
due to their range. 

7.7.2 Computation of the normal-gamma density estimator by f ft. 

The normal gamma density fx = /"^ k e computed with the Fast Fourier Transform (f f t) function on a 
regular grid of [0, T] where T = fj, + 5a + q and q is the gamma distribution 0.99999-quantile, then interpo- 
lated on other points. 

Approximation of fx- Let tpx denote the moment generating function of X, 

(px{t)^E[e''^], yteR. 

Then 

^x{t) = ^s{t)vB{t) = (1 - iter'^e^'^'e-^"'"^'. 

Moreover, for every a; G M, 
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/oo 
<fix{t)e-''^dt 
-OO 



i=o 

exp ^— ifx + ~ (Riemann sum) 

for large A and AT and small ratio A/N. Let us denote by /^^ this approximate. 

Computation using f ft 

The f f t function of R is defined as follows. For a vector V of length N, 

N 



fft(V) = ^ Vblexp — (j - 1)(A: - 1) 



Let 



and 

then for fc = 0, ...,iV- 1, 

W[fc] 



V=j{0:N-l) 

/ 2A 
y = ^^l-A+—iO:N-l) 



W = exp(aU)f ft(V) 



Nw 



eMi^k - l))J2vx (-A + —{j - 1)] 

.7=1 ^ ^ 



exp(-^(i-l)(fc-l))=7^(^). 



7.7.3 Normal-gamma MLE computation. 

The parameters (/i, a, k, 6) of the normal-gamma model are estimated by likelihood maximization, ensured 
by the R- function optimx with the following initialization values. 

/zo = mean(Xj, j e Jo), 

(70 = /Qi?(X,-,jG Jo)/1.349, 

6*0 = (sd{Xj,j e Jf - al)/ (mean(Xj-, j e J) - /^o), 

fco = (mean(Xj, j e J) - Ho)/0o, 

where /Qii designs the interquartile range and the new parametrization 

(Pi,P2,P3,P4) = {n,a,k9,eVk), 

in order to get homogenous parameters. 
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7.8 Infer the negative probe intensities from Illumina detection p-values 
7.8.1 Algorithm 

For a given array, let X and N be the vectors of regular and negative probe intensities, and P the vector 
of detection p-values. For each regular probe j with intensity Xj, the detection p- value Pj is equal to the 
proportion of negative probes which intensities are larger than Xj-. 

Pj = —Card{k,Nk > Xj} 

^neg 

where rineg is the number of negative probes. Let Q be the vector of ordered unique values of P, and denote 
by e{Q) its length. For every fc e {2, . . .,i{Q)}, let: 

xl = ma,x{X j,Pj =Qk} 
xl = mm{X j,Pj = Qk+i} 

then 

k 

Qk+i — Qk = 

^neg 

where d is the number of negative probes whose intensity lies in [x\, a;^]. Therefore, for every A; G {2, . . . , ^{Q)}, 
we infer d negative probes with intensity equal to 

^{x\ + x^). 

Moreover, if min((5) = 0, we infer a negative probe with intensity max{Xj,Pj = 0} and if min((3) = 
do/^neg, we infer do probes with intensity equal to mm{X). 



7.8.2 Performances 

For the ten human microarrays. wc have computed the parameters 1) with the true negative probe intensities, 
2) with the set of inferred negative probe intensities, for the normal-gamma model and for the normexp model 
with MLE and NP parameters (the RMA algorithm do not uses the negative probe intensities). Table C 
presents the relative error on parameters and on background corrected intensities. 



















Method 






a 


k 


( 


9 


S 


Normal-gamma 


2.9 E-5 


5.2 E-4 


7.9 E-4 


4.2 


E-4 


6.3 E-4 


normexp-MLE 


4.7 E-6 


5.t 


i E-5 




7.2 


E-4 


2.1 E-5 


normexp-NP 


1.4 E-5 


l.i 


i E-4 




7.2 


E-6 


1.3 E-4 



Table C. Relative error of estimation from replacing negative probe intensities by inferred values. Column 
2-5: parameters of the model; Column 6: background corrected values. 
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